arXiv:l507.01454v 1 [math.ST] 6Jul2015 


arXiv: math.PR/0000000 


Principal Component Analysis of Persistent Homology 
Rank Functions with case studies of Spatial Point 
Patterns, Sphere Packing and Colloids 

Vanessa Robins*, Katharine Turner 

The Australian National University; The University of Chicago, 
e-mail: vanessa.robins@anu.edu.au; kate@math.uchicago.edu 

Abstract: Persistent homology, while ostensibly measuring changes in topology, captures multiscale geo¬ 
metrical information. It is a natural tool for the analysis of point patterns. In this paper we explore the 
statistical power of the (persistent homology) rank functions. For a point pattern X we construct a filtration 
of spaces by taking the union of balls of radius a centered on points in A, X a = U x ^xa). The rank 
function /3k(X) : {(a, b) 6 M 2 : a < b} —> R is then defined by /3k{X)(a,b) = rank (t* : Hk{X a ) —* F/fc(W)) 
where i* is the induced map on homology from the inclusion map on spaces. We consider the rank functions 
as lying in a Hilbert space and show that under reasonable conditions the rank functions from multiple 
simulations or experiments will lie in an affine subspace. This enables us to perform functional principal 
component analysis which we apply to experimental data from colloids at different effective temperatures 
and of sphere packings with different volume fractions. We also investigate the potential of rank functions in 
providing a test of complete spatial randomness of 2D point patterns using the distances to an empirically 
computed mean rank function of binomial point patterns in the unit square. 

Keywords and phrases: topological data analysis, persistent homology, Betti numbers, spatial point pat¬ 
terns, sphere packing, functional principal component analysis. 


Introduction 

Random point patterns arise in a wide variety of application areas from astrophysics to materials science to ecology 
to protein interactions. The points might represent galaxies, colloidal particles, locations of trees, or molecules, 
and their distribution in space is indicative of the underlying processes that created the pattern. When studying 
such systems a number of questions arise. For example, could a given pattern be generated from a purely random 
process with no underlying interactions between the objects (points)? If this null hypothesis can be rejected then 
we would like to say whether a proposed theoretical model generates patterns that are consistent with experimental 
observations. We might also need to compare a large number of point patterns and classify them into different 
groups, or quantitatively track changes over time to understand the dynamics of a system. 

Stochastic geometry provides various tools to characterise random spatial patterns and these have mostly focussed 
on first and second-order techniques analogous to means and variance in single-variable statistics [1]. There are a 
number of situations where more sensitive tests of structural difference are required [2]. Persistent homology is 
an algebraic topological tool developed for data analysis that is an intuitively appealing measure of higher-order 
structure and encompasses spatial correlations of all orders [3, 4]. This paper shows how to adapt persistent homology 
information to perform statistical analyses of spatial point patterns, such as principal component analysis (PCA). 

Topology is the study of spatial objects equivalent under continuous deformations — the old joke is that a 
topologist can’t tell the difference between a coffee mug and a bagel. The homology groups of a space X, Hk(X), 
k = 0, 1 , 2,. .. are algebraically quantified topological invariants that provide information about equivalent points, 
loops, and higher dimensional analogs of loops. Homology detects a ^-dimensional hole as a ^-dimensional loop 
(cycle) that does not bound a (k + l)-dimensional piece of the object. The ranks of the homology groups are called 
Betti numbers : /3q counts the number of connected components, /?i the number of independent non-bounding loops, 
/?2 the number of enclosed spaces in a three-dimensional object. A solid bagel and a coffee mug both have /3q = 1, 
/?i = 1, /?2 = 0, while their surfaces have /3q = 1, f3\ = 2, /?2 = 1. Homology groups and their Betti numbers are 
inherently global properties of an object that are sensitive to some geometric perturbations (tearing and gluing) 
and not to others (continuous deformation). 

Another topological invariant that has recently been used to summarise structure in point patterns is the Euler 
characteristic signature function [2, 5]. The Euler characteristic is the alternating sum of the Betti numbers (for 
three-dimensional objects, x — A) — At + Z^)- It is a topological invariant but also has measure-theoretic properties 
that make it more amenable to statistical analysis than the Betti numbers, but it is a less sensitive topological 
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invariant by definition. Nevertheless, the above studies have demonstrated that the Euler characteristic signature 
function is sensitive to short-range, higher-order correlations in point patterns. 

To determine the homology groups, the topological space must be represented by simple building blocks of points, 
line segments, surface patches, and so on, that are joined together in a specific way, i.e., as a cell complex. The 
homology groups are then defined via a boundary operator, d k , that maps each cell of dimension k onto the cells of 
dimension k — 1 in its boundary: 

dk : Ck C k ~i (0-1) 

The kernel of d k is called the cycle group Z k and the image of d k +i is called the boundary group B k . All boundaries 
are cycles, so we can form the homology group as the quotient H k = Z k /B k . See the text by Hatcher for a 
comprehensive treatment of homology theory [6], or [7] for a concise overview aimed at physicists. 

When examining point data, a parameter must be introduced to define which points are connected to one another 
and so build the cell complex. A crucial lesson learnt in the early days of topological data analysis is that instead of 
trying to find a single best value for this parameter, much more is learned by looking at how the homology evolves 
over a sequence of parameter values. Rather than working with single cell complex we work with a filtration , a 
family of spaces K a such that K a C K & whenever a < b. Often this parameter a is a length scale, so that although 
we are measuring topological quantities, the way these change tells us about the geometrical features of the data 
set. For example, if we used a filtration of M 3 defined by lower level sets of the distance function to the surface of 
an ideally-smooth bagel, we would be able to read the radius of the hole of the bagel from the function /3i(K a ) as 
it is at that radius that the space of loops changes from one to zero dimensional. If the bagel is a real one with 
irregular cross-section, bumps and splits, the Betti number function /3i(K a ) will not be a clean step function as it 
is in the ideal case, but may jump around and obscure the exact point of the large-scale change from bagel to blob. 

The issue of topological noise, i.e., the lack of stability of the Betti numbers, motivated the development of 
persistent homology in the 1990s [8, 9, 10]. The inclusion of K a C for a < b induces a homomorphism between 
the homology groups H k (K a ) and H k {Kif) that tells us which topological features persist from K a to and which 
disappear (i.e., get filled in). The persistent homology group is the image of H k (K a ) in H k {Kif), it encodes the 
k- cycles in K a that are independent with respect to boundaries in K k : 

H k {a , b) := Z k (K a )/(B k (K b ) D Z k (K a )). (0.2) 

Algorithms for computing persistent homology from a given filtration are quite simple in their most basic form [10, 
11], and are now implemented efficiently in a number of freely-downloadable packages [12, 13, 14, 15]. 

The two most common ways of representing persistent homology information are the barcode [16] and the 
persistence diagram [17]. The barcode is a collection of intervals [6, d) each representing the birth, 6, and death, d, 
values of a persistent homology class. Equivalently, the persistence diagram is a set of points (6, d) in the plane. 
The main problem with these objects is that they are very difficult to work with statistically. Distances between 
persistence diagrams and definitions of their means and variance require advanced analytical techniques [18, 19, 20]. 

In this paper we return to the above definition of persistent homology groups and quantify them via their rank, 

/3 k (a,b) := rankid/ c (a, 6), for a < b. (0.3) 

This persistent homology rank function is an integer-valued function of two real variables and can be thought of as 
a cumulative distribution function of the persistence diagram. Since the persistent homology rank function is just a 
function we can apply standard statistical techniques to analyse distributions of them. This rank function is related 
to the size function [8] and has also been defined for multidimensional persistence, in the case of filtrations that 
are built using two or more parameters [21, 22]. Other functional summaries of persistence diagrams have also been 
proposed recently and termed persistence landscapes and silhouettes , see [23, 24]. The persistence landscapes A k (i, t) 
are a family of functions so that for each i = 1, 2, 3,..., A k (i,t) is a function of a single variable (the subscript k is 
the homology dimension as above). The variable t is related to the persistence diagram coordinates by t = (b + d)/ 2. 
We argue that the rank functions used here, /3&(a, 6), retain a more direct connection to the geometry and topology 
of the original data, although they are functions of two variables. In particular they are more suitable for analyzing 
distributions of local point configurations. 

This functional approach to persistent homology (either using landscapes or rank functions) greatly simplifies 
the business of “doing statistics” with persistent homology. It provides a framework where it is simple to compute 
averages and variances of these topological signatures from many data sets generated by a particular system, and 
to make statistically rigorous statements about whether an experimentally-observed pattern is compatible with a 
theoretical model. In particular the pointwise average of f3 k (a,b) is a function on M 2+ which tells us the expected 
ranks of the corresponding persistent homology groups. We can also perform functional principal component analysis 
(PCA) where the principal component functions are also functions on M 2+ containing topological information. 
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In section 1 we provide the necessary background to define the persistent homology rank function (from now 
on shortened to rank function). We then introduce a Hilbert space of functions in which the rank functions exist 
in section 2. In section 4 we review the definition of functional PC A and explain how to compute the principal 
component functions and scores using the dot product matrix derived from a set of functions. 

The applications we consider in this paper all pertain to point processes in some way. In section 3, we propose 
a test for whether point processes are completely spatially random using the distribution of distances to the mean 
persistent homology rank function and compare its power to other standard tests for a variety of models of point 
processes. The final two sections apply functional PCA to real world data sets. We first analyze the distributions 
of colloidal particles at a fluid interface from an experiment designed to test theories of melting of two-dimensional 
crystals. The first principal component explains much of the variation (> 96% for dimension 0 and > 86% in 
dimension 1). Furthermore, it separates the colloids into the separate phase groups. We then study random sphere 
packings using the locations of centers of the spheres derived from x-ray images and show that in both dimension 1 
and dimension 2 persistent homology rank functions, the first principal component explains > 97% of the variation 
and is highly related to volume fraction of the packings. Persistent homology gives topological information about 
the distributions of local arrangements for more and less efficient packings. 


1. Persistent homology rank functions 

To define persistent homology rank functions we will need to first provide some background theory on homology 
and persistent homology groups. The simplest homology theory to define is simplicial homology, whose ingredients 
are a topological space defined by a simplicial complex and a coefficient group, which for persistent homology 
computations is usually the field Z 2 . 

A k-simplex is the convex hull of k + 1 affinely independent points vo,vi ,.. .Vk and is denoted 
For example, the 0-simplex [t; 0 ] is the vertex vq, the 1-simplex [vq,vi] is the edge between the vertices vq and v\ 
and the 2 simplex [r>o, rq, ^ 2 ] is the triangle bordered by the edges [vo,vi], [^ 1 ,^ 2 ] and [^ 0 ,^ 2 ]- Technically, there is 
an orientation on simplices. If r is a permutation then [^o,^i,... ,f/c] = (—l) sgn %) [^(op W(i)> • • • ? v r(k)\- However, 
if we are considering homology over Z 2 then 1 = — 1 and we can ignore orientation. 

We call [uq,ui, .. .Uj\ & face of [vq,vi, .. .Vk\ if {uq,ui, ... Uj} C {vo, ^ 1 , • • • Vk}- A simplicial complex IT is a 
countable set of simplices such that 

• Every face of a simplex in K is also in K. 

• If two simplices <j \, 02 are in K then their intersection is either empty or a face of both g\ and 02 . 

Given a finite simplicial complex iT, a simplicial k-chain is a formal linear combination (with coefficients in our 
field of choice) of ^-simplices in K. The set of fc-chains forms a vector space Ck{K). We define the boundary map 
dk : C k (K) -A Ck-i{K) by setting 

k k 

dk([vo,vi, • • •Vk ]) = Yl-mvo, • • - FT • • -Vk\ = • • - v k\ 

3 =0 j =0 

for each /c-simplex and extending to ^-chains using linearly. The second equality follows because our coefficient 
group is Z 2 where —1 = 1 . 

Elements of B^(K) = imdk+i are called boundaries and elements of Z^{K) = kerd/c are called cycles. Direct 
computation shows dk+i 0 dk = 0 and hence Bk(K) C Zk(K). This means we can define the k th homology group of 
K to be 

H k (K) := Z k (K)/B k (K). 

A comprehensive introduction to homology can be found in [ 6 ]. 

We now consider the definition of persistent homology groups of a filtration. A filtration K = {K r \r E 1} is a 
countable simplicial complex indexed over the real numbers such that each K a is a simplicial complex and K a C K 5 
for a < b. We wish to describe how the topology of the filtration changes as the parameter increases. For a < b we 
have an inclusion map of simplicial complexes t : K a -A K & that induces inclusion maps 

l : Bk(K a ) -A Bk{K]f) and i : Zj^(K a ) -A- Z^Kjf). 

These inclusions induce homomorphisms (which are generally not inclusions) on the homology groups: 

i a A b : H k (K a ) -> H k (K b ). 
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The image of t^ b consists of equivalence classes of cycles that were present in K a , where the homological equivalence 
is measured with respect to boundaries in K 5 . We therefore define the persistent homology group for a < b as 

H k (a, b) = t(Z k (K a ))/(B k (K b ) n t(Z k (K a ))). 

It is worth observing that Hk(a,a) = Hk(K a ) and hence /3k(a,a) = /3k(K a ) which motivates the use of the symbol 
f3 for the rank function as it is a generalization of the traditional Betti numbers. 

Let M 2+ := {(x,y) G (—00 U R) x (R U 00 ) : x < y}. We define the k- th dimensional persistent homology rank 
function corresponding to the filtration K to be 

Pk{K) : R 2+ —»■ Z 

(a, b) ^ dim b ) 

There are two alternative arguments for the name “rank”. Firstly, f3k(K)(a,b) is the rank of H^(a^b) viewed as a 
module over Z 2 . Alternatively, since Hk(a,b) is isomorphic to t^ b (Hk(K a )) we know that /3k(K)(a,b) is the rank 
of the function i^ b : Hk(K a ) —X Hk{K^). 

We call a filtration tame if dim Hk(a, b) is finite for all a < b. From now on we will restrict our attention to tame 
filtrations (and hence everywhere finite valued rank functions). This restriction will be satisfied by any application 
involving finite data. 

The rank function contains the same information as the persistence diagrams and barcodes used frequently 
in computational topology literature. The persistence diagram in dimension denoted PD&, consists of points 
(b(a),d(a)) G M 2+ where b(a) is the birth value of a homology class and d(a) is the death value of the filtration 
parameter. We say that a homology class a G Hk(K h ^) is born at b(a) if it is in the cokernel of L s ^ h ^ f or an y 
s < b(a). That is, a, is not the image of any cycle that occurs earlier in the filtration. The homology class of a 
dies at d(<a) if for b(a) < t < d(a) we have a ^ kerq^ 0 ^ - ^ but a G ker^ a ^ d ^. Informally we can think of the 
process of dying as the cycle becoming a boundary or two cycles merging. The difference d(a) — b(a) is termed the 
persistence of the cycle a. Some cycles can have d(a) = 00 , these are called essential classes. 

Each rank function can be thought of as a cumulative function of a persistence diagram. This is because as the 
quantity /3k(a,b) is the same as the number of points in the persistence diagram in the region (— 00 , a] x [ 6 , 0 c). 
The correspondence between persistence diagrams and rank functions makes it easy to compute the persistence 
diagrams from rank functions and vice versa. This is convenient as there are many available computer packages to 
compute persistence diagrams [12, 13, 14, 15]. 

Rank functions have some nice monotonicity properties [17, 21]. If /? is a persistent homology rank function then 
for a < c < b < d 

/3(c, b) - /3(a, b) - /3(c, d) + /3(a, d) > 0. (1.1) 

This is because the inclusion exclusion principle tells us that /3(c, b) — /3(a, b) — /3(c, d) +/3(a, d) should be the number 
of points in the corresponding persistence diagram lying in the region (a, c] x [ 6 , d). This is depicted in Figure 1 



Fig 1. (3(c,b) — (3 (a, b) — /3(c, d) + /3(a, d) for the rank function (3 is the number of points in the corresponding persistence diagram lying 
in the shaded region 

A common way to produce filtrations is to consider sublevel or superlevel sets of a continuous function. Given 
/ : I G R we can set K t = / - 1 (— 00 , £] for a filtration by sublevel sets. The filtration by superlevel sets of / 
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is effectively the same as the filtration by sublevel sets of —/. Combinatorial cell complexes derived from scalar 
functions with non-degenerate critical points are called Morse complexes. An important example that we will focus 
on in this paper is the distance function to a set of points. The filtration induced by the distance to a set of points 
is modelled by a simplicial complex called the Cech complex. Let S be the set of points {t’l,... vx} Cl d . The Cech 
filtration, {C(S, r) : r > 0 }, is the filtration over the complete TV-simplex where 

[Vi x , • • •, V ik ] e C(S) r whenever n*Lj B{v ij , r) ^ 0. 

The Nerve Lemma [25] states that C(S,r) is homotopic to U ve sB(v,r). This means the Cech filtration has the 
same persistent homology as the filtration by sub level sets of the distance function. Computationally, the Cech 
filtration can be efficiently implemented for points in R 2 and R 3 by alpha shape subcomplexes of the Delaunay 
triangulation [26]. 

Another common alternative filtration for point data used in Topological Data Analysis is the Rips filtration. 
The Rips filtration, {7 Z(S,r) : r > 0}, is the filtration of the complete TV-simplex where 

[v^,. ..,v ik \ G C(S) r whenever < r for all j,l. 

It is the flag complex whose 1-skeleta agree with that of the Cech complex. The Rips complex is often used in 
high-dimensional computations as it only requires knowledge of pairwise distances. Unfortunately it loses subtle 
local geometric structure; for example, it cannot distinguish triangles and tetrahedra whose vertices are nearest- 
neighbour s. 


2. Rank functions are a subset of a Hilbert space of functions 

Persistent homology rank functions he in the space of real valued functions from the Riemannian metric space 
(R 2 + ,g) to R. We consider a metric d(-, •) on this space of functions defined by 

d(f,h) 2 = f (/ - h) 2 dy (2.1) 

J R 2 + 

where pt is the measure on R 2+ corresponding to the metric g on R 2+ . There are many choices of the metric g and 
hence also its corresponding measure ft. A potential problem is that depending on the choice of /q the pairwise 
distances between rank functions may be infinite. For example, if we choose pt to be the standard metric for the 
plane restricted to R 2+ , (i.e., Lesbesgue measure A) the distance between two rank functions can only be finite 
when their respective sets of essential cycles have identical birth times. Otherwise, there is an infinite region in 
the plane where the rank functions differ and the Lesbesgue measure of this region is infinite. To address this 
issue, we consider different measures on R 2+ obtained by multiplying Lesbesgue measure by a weighting function, 
fjt((x,y)) = (j){y — x)X((x,y)). Making (j) a function of (y — x) means the measure pt is a function of the persistence 
or lifetime of each homology class. Given a weighting function <fi the distance function is then 

d<j>{f, h) 2 = f (f - h) 2 <p(y - x)dxdy. ( 2 . 2 ) 

J x<y 

The choice of weighting function is analogous to the choices made in multivariate PC A where rescaling the original 
coordinates (by changing the units of measure) will affect the PCA components. 

The following Lemma establishes sufficient conditions to ensure rank functions are a finite distance apart. Then 
given a set of rank functions, each pairwise a finite distance apart, we can consider the affine space that they he in 
and so construct means and principal components. 

Lemma 2.1. Let <fi : [0, oo) —>• [0, oo) such that J 0 °° <j>(t) dt < oc. Let fx and /l be rank functions constructed from 
filtrations K t and L t . If and L ^ are finite simplicial complexes with H^K-qq) = 1L*(L_oo) and H^Kqq) = 
H^Lqo) then d^fx^h) < oc. 

Proof. Let Xx and Xl be the persistence diagrams corresponding to fx and f^. Since and L 00 are finite 
simplicial complexes there exists finite m and M such that every point in the persistence diagrams is of the form 
(—oc, 6 ), (—oc, oc), (a, b ) or (a, oc) where m < a, b < M. There also exists an TV such that 0 < fx(%, y), /l(^, y) < TV 
for all (x, y) G R 2+ . 

The assumption that H*(K- oo) = H*(L_ oo) implies that Xx and Xl have a number of points whose first 
coordinate is —oc. This implies that the fx{%,y) = /l(^? 2 /) whenever y < m. Similarly fx(x,y) = /^(x, y) 
whenever x > M. 
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d(fK, /i) 2 = (Ik - /i) 2 0(y - x) dx dy 
J x<y 

< / N 2 (j)(y — x) dx dy 

J {(x,y):x<y,y>m,x<M} 

< OO 

□ 

The sufficient conditions expressed in Lemma 2.1 are by no means exhaustive. However, practically speaking, 
whenever the input is finite (or from a finite discretization) and the starting and ending stages of the filtrations 
are the same, the rank functions will he in the same affine space and this will cover almost any real application. In 
particular, rank functions computed on the Cech complexes of point patterns, the colloid and the sphere packing 
data all satisfy the conditions in Lemma 2.1. 

A scenario not covered by Lemma 2.1 is if we considered rank functions constructed from the Cech complexes 
on two point patterns but one of the point patterns lived in the unit square and the other lived in the flat torus. In 
this example the pairwise distance will be infinite. 

Now fix a function h : R 2+ —)> R and consider the space of functions, denoted A(h), that are a finite distance from 
h. This is an affine space that can be transformed to a vector space, V(h), by centering about h\ V(h) = {/ — h : 
/ G A(h)}. We can put an inner product structure on V(h) using 

(vi,v 2 ) = / v 1 v 2 dp. 

J R 2 + 

We will generally use the mean persistent rank function for centering. Let fd 1 , /3 2 ,... /3 n be k- dimensional rank 
functions. We can define a mean persistent homology rank function simply as 

n 

P(a,b ) = -V ]P z (a,b) 

i=i 

for all (a, b ) G R 2+ . This definition retains topological meaning because /3(a, b ) is the expected dimension of Hk(a, b). 
Similarly, we can define a mean persistent homology rank function over a distribution of rank functions by taking 
the appropriate pointwise defined integrals. The mean rank function is unlikely to be a persistent homology function 
as it will probably have some non-integer values. However it is easy to check that it will still retain the monotonicity 
properties of persistent homology functions (1.1). 

The following corollary holds in more general contexts but is sufficient for our needs. 

Corollary 2.2. Let X be a manifold with bounded curvature and finite dimensional homology (such as X = R d ). 
Let {/3 1 , /3 2 , /3 3 ,... /3 n } either be the set of persistent homology rank functions from the Cech filtrations of different 
finite point clouds in R d or from the filtrations of sub level sets of kernel density functions from different finite point 
clouds in X. Let fd be the (pointwise) mean of the fd l . Then the {yd 1 , yd 2 , ..., /d n } C A{jd) and {jd 1 —yd, yd 2 —yd, ...fd n — /d} 
lie in an at most (n — 1)-dimensional vector space. 


3. Application: Testing whether point patterns are completely spatially random 

One of the basic questions in the field of spatial point process analysis is testing the hypothesis that a point 
process comes from some particular model. The method we outline here can be used for any model (after fixing all 
parameters) but we have restricted ourselves to the case of testing whether point patterns are completely spatially 
random — a common null hypothesis. A spatial point process is completely spatially random (CSR) if there is 
no interaction between the locations of the points. The two main CSR models are (homogeneous) Poisson point 
processes and (homogeneous) Binomial point processes. 

Definition 3.1. Let / be a probability density function on a set A and let n G N. A point process X consisting 
of n independent and identically distributed points with common density / is called a binomial point process of n 
points in A with density /. If / is constant then the binomial point process is called homogeneous and is CSR. 

Definition 3.2. A point process X on A is a Poisson point process with intensity function p (and intensity measure 
p) if the following two properties hold: 
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• For all 5ci, the number of points in X lying in B C A follows a Poisson distribution with mean ft(B). 

• For B, C C A with empty intersection, the number of points in X lying in B and C are independent. 

These two CSR models are intimately related — a Poisson distribution is a binomial point process X of P(A) 
points, where P( A) is a drawn from a Poisson distribution. 

Here we propose a test for point patterns being CSR using persistent homology rank functions. The persistent 
rank functions corresponding to the dimensions 0 and 1 each provide a different test. The first step is to estimate 
empirically the mean persistent rank function /3k- Then using a new set of CSR patterns we estimate the distribution 
of distances squared to /3&. We then determine a cutoff Ck depending on a chosen p-value (i.e., confidence level). 
New point patterns are then deemed not CSR if their distance to fik is more that Ck- 

We used this test with point patterns drawn from a variety of point process models to demonstrate its power. 
The same point patterns were also tested using some standard CSR tests for the sake of comparison and the results 
are presented in Table 3. The Completely Spatially Random model used for each test is a set of 100 i.i.d. points 
patterns in the unit square. The estimated mean persistence rank functions /^o, were computed using 300 CSR 
point patterns. The distributions of the distances squared to fik were computed using an additional independent 
set of 200 CSR point patterns. These provided cutoffs of Co = 0.05 and C\ = 0.01 corresponding to the p-value of 
0.05. 



Fig 2. (a)The mean Ho persistent rank function of 300 i.i.d. CSR point patterns of 100 points in the unit square, (b) The distances 
squared of 200 independent CSR patterns from the mean rank function depicted in (a). 



(a) (b) 

Fig 3. (a)The mean H\ persistent rank function of 300 i.i.d. CSR point patterns of 100 points in the unit square, (b) The distances 
squared of 200 independent CSR patterns from the mean rank function depicted in (a). 


We tested 100 point patterns each from a variety spatial point processes in the unit square, each conditioned on 
the number of points being exactly 100 (to match the Binomial CSR model). The four models are another set of 
CSR patterns, Baddeley-Silvermann, the Strauss pairwise-interaction model and a Matern clustering model [27]. 
We then compare our results with those given by other standard tests on the same set of simulated point patterns. 
These other tests were the Diggle-Cressie-Loosmore-Ford test (DCLF), Maximum Absolute Deviation test (mad), 
quadrant test (quad) and Clark and Evans test (Clark Evans) all available in the ‘spatstat’ R package [27]. 


imsart-generic ver. 2014/07/30 file: PCAwithPHF-arxiv.tex date: July 7, 2015 






















Vanessa Robins and Katharine Turner/PC A of Persistent homology rank functions 



CSR 

Strauss 

Matern Cluster 

Baddeley-Silverman 

DCLF 

2 

1 

98 

0 

Mad 

0 

0 

98 

0 

Quad 

3 

10 

97 

1 

Clark Evans 

5 

56 

67 

96 

dim 0 

3 

73 

14 

99 

dim 1 

7 

17 

59 

81 


Fig 4. The number of point patterns (out if 100 testing) for which we rejected the null hypothesis that they were CSR. Here we compare 
the results from the persistent homology rank test we constructed with some standard tests available in the R-package spatstat [27]. 


We now will give a brief overview of theses models and all the parameters we chose. 

The Strauss process with interaction radius R and parameters /3 and 7 is the pairwise interaction point process 
with probability density 

f{x 1 ,...,x n ) = ^p< x h s[x) 

where {aq,..., x n } represent the points of the pattern, n(x ) is the number of points in the pattern, s(x ) is the 
number of distinct unordered pairs of points that are closer than R units apart, and is the normalizing constant. 
By conditioning on a fixed number of points the parameter [3 is absorbed into the normalizing constant. We used 
R = 0.05 and 7 = 0.5. 

The Matern clustering model we used is as follows. First determine the location of the “parent” points over 
the plane by a Poisson distribution with intensity 10. Then each parent point is replaced by a random cluster of 
“offspring” points, the number of points per cluster being Poisson( 10) distributed, and their positions being placed 
and uniformly inside a disc of radius 0.02 on the parent point. We then discard everything outside the unit square. 
We conditioned on 100 points in total (by redrawing when not 100 points). 

The Baddley-Silvermann spatial point process model provides a counterexample that CSR can be determined 
by the distribution of pairwise distances. Divided the space into 100 equal rectangular tiles. In each tile, a random 
number of random points is placed. There are either 0,1 or 10 points, with probabilities 1/10, 8/9 and 1/90 respec¬ 
tively. The points within a tile are independent and uniformly distributed in that tile, and the numbers of points 
in different tiles are independent random integers. Unlike the normal Baddley-Silvermann, we then conditioned on 
100 points in total (by redrawing when not 100 points). 

The Monte-Carlo simulations show that even when the number of points within the point patterns involved is 
small (here 100 ), the rank functions can be used as a test for complete spatial randomness. As could be expected, 
the power of the test is much higher for repulsive models than clustering models. The power of the tests should 
improve as the number of points increases. The intuition behind this claim is twofold. Firstly, a functional version 
of the central limit theorem would imply that the variation of the corresponding distributions of the persistent rank 
functions will decrease. In other words the distributions of persistent rank functions of each model will concentrate 
around the means (both for CSR and the other models) implying both that the cutoff radius will decrease and that 
the measure of the alternative model in that ball will decrease. Secondly, the effect of the boundary will decrease. 
This should be particularly true for the Matern Clustering model where inspection shows that there are a significant 
proportion of larger voids which do not appear as Hi persistent homology classes because they appear near the 
boundary of the unit square. 
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Fig 5. (a), (c), (e) are the mean Ho persistent rank function of 100 i.i.d. spatial point patterns from the models of Strauss, Matern 
clustering and Baddeley-Silvermann, respectively, (b), (d) and (f) are the corresponding distributions of distances squared to j3o. 
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(e) (f) 

Fig 6. (a), (c), (e) are the mean H\ persistent rank function of 100 i.i.d. spatial point patterns from the models of Strauss, Matern 
clustering and Baddeley-Silvermann, respectively, (b), (d) and (f) are the corresponding distributions of distances squared to (3\. 


imsart-generic ver. 2014/07/30 file: PCAwithPHF-arxiv.tex date: July 7, 2015 























Vanessa Robins and Katharine Turner/PCA of Persistent homology rank functions 


11 


4. Functional Principal Component Analysis 

In this section we give a general outline of functional principal component analysis and how to compute it using 
the inner product matrix of a set of centered functions. This is a standard technique but is included here for 
completeness and for accessibility for those from other disciplines. 

Given a set of functions {fi, f 2 , •.., / n }, whose mean is zero (constructed by just subtracting the mean from each 
of the functions in the initial set) we can perform functional principal component analysis. Functional PCA gives a 
sequence of principal component weight functions Cl ? C2 , - - - , Cn-i and principal component weight scores s^. These 
can be defined inductively. From now on let (,) denote an inner product on the space of functions V. 

The first principal component weight function, (j, is defined to be the norm one function that maximizes 
Sr=i(Ci> fi) m We define Q to be the function that maximizes 5 Hi=i(Cb fi) subject to the constraint that Q is 
a norm one function which is orthogonal to Cl: C2, - - -, (j- 1- We define Sij := (Q, fi). 

Since the set of functions / 1 , f< 2 ,... f n lies in an n — 1 dimensional subspace of the space of functions we know 
fi = s ijCj- However, since the principal component weight functions are ordered to be in the directions 

of greatest variance, approximating each function fi by a linear combination of the first k principal component 
weight functions provides the best ^-dimensional representation of the sample functions. In other words, we can 
approximate fi by the partial sums 1 s ijCj- How well these partial sums approximate the original data is often 
expressed in the proportion of explained variance. If A j is the variance of the scores (i.e. A j = Y^i=i s ij) 

then total variance is an d hence the proportion of variance explained by the projection into the first k 

coordinates is 



When working with finite dimensional data, PCA proceeds by first creating a matrix X where each row is a data 
point and each column is a coordinate axis (i.e., a variable). The principal components are then the eigenvectors of 
X T X ordered by the largest corresponding eigenvalue. These eigenvalues are also the variances of the corresponding 
sets of scores. A simple adaption of PCA to functional data represents each function by its values at a fixed, 
finite number of evenly spaced coordinates; then each function becomes a row in the matrix X and each column 
is a coordinate where the functions are evaluated. The matrix product X T X = (XX T ) T has entries that are 
approximations to the Lebesgue inner product between functions. In this paper, however, we are working with 
functions whose inner product is defined with respect to a weighting function (2.2) so we must use a slightly 
different adaption of the standard PCA matrix representation, interpreting the matrices as a linear operators as 
follows. 

Given a set of n rank functions, {/ 1 , / 2 , • • • / n }, define the linear operators 

X : L 2 (M 2+ ,£) 

9*-+ {{g,h),{9,h),---,{9,fn)) 


and 


X T : K n L 2 (R 2 +,g) 

n 

(ai,a 2 , ...,a n )-> 'P/kfi- 

A matrix representation of XX T acts on elements of M n with the (i,j) entry being (fi, ff). The non-zero eigenvalues 
and eigenvectors of X T X can then be found from the eigenvalues and eigenvectors of XX T because for w G M n , 

XX T w = Xw =>► (X T X)X T w = XX T w. 

Thus, to find the principal component weight functions £ 1 , (" 2 , • • •, Cn we first compute the eigenvalues and eigen¬ 
vectors Ai,... A n and uq, ... w n of the matrix of inner products ((fi, fj)) 7 f j = x and then set (k to be the unit-norm 
function that is a scalar multiple of X T 

4-1- Outline of algorithm 

We will now outline the process of computing PCA of discretized rank functions from a set of persistence diagrams. 
We start with persistence diagrams as there are many standard algorithms to compute persistence diagrams. The 
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first step is to transform the persistence diagrams to rank functions. Discretize the plane by choosing a lattice over a 
relevant region over the plane above the diagonal. This provides a list of coordinates {(jq, yi), (jq, 3 / 2 ), • • • (%,1/m)}- 
Each rank function will be written as a length M vector. 

Given a persistence diagram we want to compute the corresponding rank function. Start with the zero vector of 
length M. Then iterate through the points in a persistence diagram: for each point (a, b ) in the diagram add 1 to 
coordinate l when with a < xi < yi < b. These are all the coordinates is the lower right triangle between (a, b) and 
the diagonal. 

Given the set of discretized persistence rank functions, written as vectors {vi, tq,... vn}), the algorithm for 
computing the mean and principal component weight functions is the following recipe; 

1. Calculate the mean v of {rq, tq, • • • vn} coordinate wise (at each sample point (x,y)). 

2. Construct new centered arrays {hi, £ 2 ,... vn} by Vi = Vi — v 

3. Calculate the dot product matrix D = ((vi,Vj))i,j using the weight function as specified in (2.2). 

4. Perform an eigenvalue/eigenvector decomposition of D, ordered by the size of the eigenvalues (largest first). 

5. Let A j and wj = (cq, c&2,... a/v) be an eigenvalue/eigenvector pair of D. Xj is the variance of the scores for the 

jth principal component. The j th principal component is the unit-normed function in the direction JT =1 a i^i- 
Thus the jth principal component is Q = ^ J2iLi a i^i where 

N N N 

i= 1 i=l 1=1 

6. We then can then reinterpret Q as the discretized function values evaluated at the sample points {(#/, 2 /z)}z=i,...m- 

Once we have computed as many principal component functions as desired we then compute the corresponding 
scores, i.e. projections of the (centered) rank functions onto the k largest principal components. These scores are 
s ij = {(jiVi)i he., dot products of the rank functions with the principal component functions. Alternatively they 
can be computed directly from the dot product matrix; 

. _ l^m=\ a m- u m,i 

VAAi Eili afaiPfi 


5. Analysis of Colloids 

We now apply the PC A techniques to two-dimensional point patterns obtained from images of colloids in an 
experimental model designed to test theories of melting in 2D crystals [28]. 1 The experimental system consists of 
paramagnetic colloidal particles of diameter a few microns, suspended in water and fixed by gravity to sit at the 
air-water interface of a hanging droplet. An applied magnetic field induces a repulsive dipole-dipole interaction 
between the particles, and the (inverse of) applied field strength plays the role of temperature. At strong field 
strengths the colloids arrange themselves in a hexagonally close-packed crystal with no defects, and as the field 
strength decreases the crystal melts, passing through two continuous phase transitions to an isotropic liquid phase. 
The intermediate hexatic phase retains the long-range 6-fold orientational order of the crystal but not its quasi-long- 
range translational order. This two stage melting of 2D crystals is described by a theory of Kosterlitz, Thouless, 
Halperin, Nelson, and Young (KTHNY) that explains it through the dissociation of pairs of defects that form close 
to the melting point. 

Extensive studies of the point patterns formed by the colloid centres [29] have demonstrated that the bond 
orientational correlation function, and bond orientational susceptibility parameter are sensitive to the transition 
between hexatic and isotropic liquid phases, as these quantities explicitly test for long-range 6-fold orientational 
order. The transition between hexatic and crystalline phases should in principal be detected by periodicity in the 
density (or two-point) correlation function, but in practice this is found to be insufficient as 2D crystals have 
a logarithmic divergence in their displacement autocorrelation function. Instead, the melting transition is better 
detected by the “dynamic 2D Lindeman parameter” which is a measure of a particle’s displacement with respect 
to its nearest neighbours as a function of time. This quantity stays bounded in a crystal (if no grain boundaries are 
present) but diverges in the hexatic and isotropic liquid phases. The work in [29] also shows that other quantities 
that measure purely local structural properties, such as the Voronoi cell shape factor or Minkowski functionals, do 
not provide a sensitive test of the phase transitions. 

1 We thank Peter Keim for giving us permission to use this data. 
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Ideally, it would be useful to find a single order parameter that changes at both phase transitions. The persistent 
rank functions of alpha shapes are a possible candidate as they are sensitive both to the local arrangements of 
particles, but also naturally incorporate higher-order point correlations and non-local topological information. The 
data presented here is an initial investigation that illustrates the basic properties of persistent homology signatures 
of colloidal point patterns. 

The data used are four groups of seven snapshots of around 800 colloidal particles; each group is taken from a 
different effective temperature, T = 0.013,0.014.0.015.0.016. These images were taken from the centre of a droplet 
that contained around 10 5 particles in an equilibrium state. Example point patterns and their 1-dimensional per¬ 
sistence diagrams are shown in Fig. 7. The filtrations are derived from the 2D alpha-shape of the colloid centres. 
Samples at T = 0.013 and 0.014 are crystalline but close to the melting transition, while those with T = 0.015 are 
in the hexatic phase and T = 0.016 are isotropic liquid. 

The principal component analysis of the persistent rank functions for these 28 point patterns shows that 96.7% of 
variation in 0-dimensional homology is explained in the first component, 97.9% in the first two components, and 99% 
in the first five. For 1-dimensional homology the variation explained is lower, 86.4%, and 92.0% reaching 99% with 15 
components. The difference in variation explained here possibly reflects the fact that the overall density of colloids 
does not change, meaning the 0-dimensional homology information is more consistent across the different effective 
temperatures, while the appearance of dislocation pairs (from 6,6 to 5,7 nearest neighbours) as T increases, means 
1-dimensional homology changes more significantly. From the plots of principal component scores in Fig. 9, we see 
that in both dimensions, the first p.c. scores cluster the point patterns into their different phases quite effectively. 
The point patterns for T = 0.013, 0.014 are noticeably closer together than to the other two temperature groups, 
consistent with them both being in the crystalline phase. The second principal component scores in dimension 0 
do nothing to help with the clustering, suggesting that the extra variation explained is physically irrelevant. The 
second p.c. scores in dimension 1 do seem to provide a little more information, and suggest that one each of the point 
patterns from T = 0.014, 0.015 might be anomalous in some way. A plot of first p.c. scores from dimension 0 versus 
dimension 1 (Fig. 10) shows a roughly linear relationship, and nice clustering again of the different temperature 
groups. 

In conclusion, the variation in persistence diagrams for these colloid data is difficult to detect by eye, but the 
PC A analysis has successfully clustered them into their temperature groups. The basic persistent homology data as 
presented here appears to change continuously with the effective temperature, and is unlikely to be a sensitive test 
of the critical phase transition point. However, it may be possible to derive a quantity from the persistence analysis 
that will be sensitive, or to make a persistence analysis of a different function derived from the point patterns. 
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Fig 7. Point patterns formed by colloidal particles (left column) and their corresponding 1-dimensional persistence diagrams (right 
column). One example from each group of seven snapshots is shown, from top to bottom the effective temperature increases from 
0.013,0.014,0.015.0.016. Units of length are micrometers. 
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Fig 8. (a) Mean rank function for /3i(s,t) of 28 colloidal point patterns, (b) Pointwise variance in the (3\ rank functions, (c) First 
principal component function. Contours in (a) and (b) match the tick marks on the colour bar, contours in (c) are zbO.Ol, ±0.03,.... 
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Fig 9. Principal component score relationships for the 28 colloidal point patterns, (a) First principal component scores versus effective 
temperature for 0-dimensional persistent homology, (b) First versus second p.c. scores for fo. (c) First principal component scores 
versus effective temperature for 1-dimensional persistent homology, (d) First versus second p.c. scores for (3\. The colours denote the 
effective temperature groups. 
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Fig 10. First principal component scores in dimension 0 versus dimension 1. Colour again denotes the effective temperature groups as 
in Fig. 9. 
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6. Analysis of Sphere Packings 

In this section we apply the functional PC A methods outlined in Section 4 to three-dimensional point patterns 
obtained from packings of spherical acrylic beads in cylindrical containers. 2 The points are the centers of the beads 
and the data sets are non-overlapping cubical subsets interior to the container. We find that the rank functions 
/?i(s, £), /^(s, t) computed from alpha shape filtrations each have around 97% of their variation explained by the 
first principal component and a further 2% when adding the second principal component. The scores against the 
first principal component relate strongly to the packing fraction (j) of the bead-packs. This supports the physical 
assumption that packing fraction is the dominant parameter influencing the configurations of bead packings. 

Monosized spherical bead packings are an idealised model of granular materials. In the language of statistical 
physics, they are athermal , dissipative and non-equilibrium systems and so fail to conform to the requirements 
of standard equilibrium statistical physics models. Despite this, disordered bead packings do have reproducible 
distributions of localised quantities that appear to depend on just a few macroscopic parameters, and there have 
been a number of statistical models since Edwards first proposed that volume could play the role of energy [30] , see 
the recent review [31]. For example, the distribution of localised volumes such as those of Voronoi cells or Delaunay 
tetrahedra have theoretically-derived distributions depending on the global volume fraction [32]. 

One of the main physical observations about monosized spherical bead packings is that although a maximally 
dense packing of beads has </> =» 0.74 [33], there appears to be a random close packing ‘limit’ at = 0.64 in the 
sense that gentle tapping of a container of beads results in this packing fraction, and to obtain a denser packing 
requires significantly more energy to be injected into the system. A maximally dense packing of beads is built from 
hexagonally close-packed layers that contain local packing configurations of regular tetrahedra and octahedra in a 
2:1 ratio, whereas packings with < 0.64 are now understood as being built of beads in quasi-regular tetrahedral 
configurations. The geometry of local bead configurations in large packings has been explored extensively using the 
Voronoi partition and the dual Delaunay complex built from the bead center points. Results for simulated hard 
sphere packings in [34, 35] show that packings with 0 > 0.645 show a significant increase in the number of Delaunay 
tetrahedra with quarter-octahedral shape. Packings with </> « 0.64 have also been shown to have almost all beads 
belonging to face-sharing quasi-regular tetrahedral configurations [36, 37]. 

In this section, we capture the local configurations of beads via persistent homology diagrams derived from 
alpha-shape (Cech) filtrations of the Delaunay complex. This means the analysis is closely related to the above 
techniques, but the advantage is that persistent homology detects correlations between adjacent Delaunay tetra¬ 
hedra in a geometrically and topologically meaningful way. For example, rather than testing for the presence of 
octahedra indirectly by computing a shape parameter to select quarter-octahedra, the persistence information cap¬ 
tures an octahedral configuration unambiguously by a point in PD2 with (6, d) = (1.15,1.41). The birth value 
for the PD2 point is the circumradius of a face of the octahedron, i.e., an equilateral triangle, and the death 
value is the circumradius of the octahedron with vertices at the centers of beads of radius 1. This range of radii 
(1.15,1.41) is exactly that for which the union of alpha-shape balls enclose a void inside the octahedron. For a 
regular tetrahedron, the corresponding range is (1.15,1.22). Further details about the information contained in 
persistence diagrams are given below. Other recent work applying persistent homology to study granular packings 
has considered two-dimensional packings of disks, examining force chains [38], the dependence of Betti numbers on 
preparation protocol, [39], and the structure of the full configuration space of a small number of hard disks [40]. 

The data used in this section were obtained from three experiments imaged with the x-ray micro-CT facility at 
the Australian National University [41, 37]. The experiments each used monosized (within 2.5%) acrylic beads with 
a mean diameter of 1.0mm that were poured into large cylindrical containers (inner diameter of 66mm) with over 
100,000 beads in each case. To create the low-density packing, A , with overall packing fraction </> = 0.59, a glass 
rod was placed in the cylinder as the beads were poured in and then gently removed. The second experiment, B , 
imaged a random close packing of beads (</> = 0.63) after pouring them into the container and then tapping gently 
to settle. In the third experiment, ( 7 , a loose lid was placed on top a random close packing of beads, the container 
was shaken intensely for a few seconds to the point of fluidisation and partial crystallisation was observed with 
resulting = 0.70. Each container of beads was imaged using micro-CT and the resulting volume maps of x-ray 
density were processed to extract the coordinates of each bead centre and its radius to within 1.5/im. 

Our first step in analysing this data is to scale the units of length in each experiment by the mean bead radius 
of the sample. We then take disjoint cubical subvolumes of equal sizes from the three experimental data sets and 
compute the persistence diagrams of alpha-shape filtrations derived from the 3D point patterns of bead centres using 
the dionysus package [12]. Persistence diagrams for subsets of between 3000 to 3800 beads from each experiment are 
shown in Fig. 11. These diagrams illustrate a number of interesting features in the geometry of bead packings. We 

2 We thank Mohammad Saadatfar for providing us with the data. 
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begin by noting that the persistence diagrams for O-dimensional homology simply record the number of beads in 
the sample and the fact that they are all in contact at the mean bead radius. The mean bead radius does not vary 
significantly across the different subsets so we do not use the PDO data further. PD1 points tell us about the geometry 
of shortest cycles: birth values record the longest edge-length in a cycle of bead contacts or near-contacts, the death 
value of each cycle is the circumradius of the largest empty triangle within the span of that cycle. The PD1 diagrams 
in Fig. 11 have three main features. The first is a “hot spot” around (6, d) = (1.0,1.15) = (1, ^), corresponding to 
an equilateral triangle configuration of beads. Then there are two approximately one-dimensional features extending 
from the hot spot, one with b « 1.0, d < 1.4 and the other along an arc reaching up to (6, d) = (1.41,1.41), the 
signature of a right-angled triangle with hypotenuse equal to 2 \[2. Cycles with b ~ 1.0 are due to beads in contact. 
Given the 2.5% polydispersity in the bead radii, we conclude that those PD1 points with b ~ 1.0 and d > 1.18 must 
be due to (minimal) cycles with four or more beads in contact. Such cycles have a significant presence in the random 
loose and close packings, A , F>, but are dramatically reduced in the densest packing, C. The cycles along the arc from 
b = 1.0 to 1.41 are due to triangular configurations with longest edge = 2b. Again, these have a strong signature in 
A and F>, but not in the densest packing C. The dense packing however, has a second hot spot at (5, d ) = (1.41,1.41) 
due to right-angled triangles in octahedral bead configurations. The presence of octahedral bead configurations is 
shown extremely clearly in the PD2 diagram of the dense packing, C . This diagram has three hot spots: one at the 
tetrahedral point (1.15,1.22) = (^, ^), one at the regular octahedral point (1.15,1.41) = (^,2\/2) and one at 
the right-angled tetrahedral point (1.41,1.41). PD2 diagrams of the random bead packings are significantly different 
from that of the dense packing. The random packings, A, B , show an arc of PD2 points extending from the regular 
tetrahedron up to the right-angled tetrahedron that can be bounded below by tetrahedral configurations with a 
single long edge, and bounded above by tetrahedra with two long edges (opposite one another). Regular octahedral 
configurations are not present in the random packings, but there is a diffuse cloud of points with 1.2 < b < 1.4, 
d < 2.0. These persistence points are related to the less dense regions of the packings. A comprehensive analysis of 
the information encoded in the persistence diagrams of real bead packings is given in another paper [42] . 



Fig 11. Persistence diagram histograms from a 28.0 3 subset of each experiment. (a),(b),(c) PD1 for subset volume fractions, = 
0.58,0.63,0.72. (d),(e),(f) PD2 for the same subsets. 


To test the apparent dependence of persistence diagrams on volume fraction, we computed 1- and 2-dimensional 
rank functions for 12 cubical subsets with an edge length of 28.0 units from each of the three experiments, calculated 
their pointwise mean and variance and performed PC A. Mean and variance functions for /3i(s,t) and /^(s, t) per 
unit volume are shown in Fig. 12 and again highlight the importance of the tetrahedral and octahedral bead 
configurations. The large number of equilateral triangle configurations generates a sharp corner at (s, t ) = (1.0,1.15) 
in both the mean and variance of /3i(s,£), and the regular tetrahedral and octahedral configurations produce 
significant steps in the mean and variance of /^(s, t) at (1.15,1.22) and (1.15,1.41) respectively. 

The PC A analysis of these 36 subsets shows that 97.7% and 97.1% of the variation in the 1- and 2-dimensional 
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Fig 12. Mean and variance of 1- and 2-dimensional persistent homology rank functions per unit volume from 36 cubical subsets with 
side l = 28.0. (a) Mean rank for (3i(s,t)/l 3 , (b) variance function for (3\ (s,t)/l 3 , (c) mean rank for fa(s,t)/l 3 , (b) variance function 
for (32(s,t)/l 3 . Contour lines match the tick marks on the colour scale in each plot. 


rank functions is explained by a single axis, 99.2% and 99.4% by two-dimensional subspaces respectively. The first 
and second principal component functions are illustrated in Fig. 13. Again, the first principal component functions 
highlight the importance of the regular tetrahedral and octahedral configurations, while the second principal com¬ 
ponent functions appear to be correcting areas where the first p.c. functions change rapidly. In particular the second 
p.c. functions seem to be related to the variation of regularity amongst the triangles (in dimension 1) and amongst 
the tetrahedra and octahedra (in dimension 2) which may be partly explained by the polydispersity in bead radii. 

We project each subset rank function onto these two principal components and see that the scores cluster 
according to the subset volume fractions, see Fig. 14. The first p.c. scores show a particularly strong relationship 
with packing fraction. The results for fa (s,t) in particular suggest a sharp transition at the Bernal density of 
< j) = 0.64. The second p.c. scores are not directly related to the subset volume fractions, but do show some trend 
within each of the three experiments. 

The next step in our analysis is to use the structure in the principal component functions to derive simpler order 
parameters for the bead packings. The essential idea is that the locations of the maxima or (negative) minima in 
the principal component functions indicate the most important geometric parameters for capturing the variation in 
the rank functions. We start with the first principal component for fa (s,t), Fig. 13(b). The maximum value occurs 
as a fairly flat triangular region below the tetrahedral point (1.15,1.22). We tested the correlation between fa (s,t) 
and first p.c. scores for values of (s, t) in this region and found the best linear relationship between them when (s, t) 
= (1.20,1.20), see Fig. 15. Geometrically, the value of fa = fa(l- 20,1.20) simply counts the number of enclosed 
voids in the alpha-shape at radius = 1.20. For the bead packings, this means it counts configurations of beads that 
are close to being regular tetrahedra and regular octahedra. This parameter, fa is therefore similar to the various 
measures of “tetrahedrisity” and “quartoctahedrisity” discussed in [34] . Our analysis confirms that these quantities 
are the most important measures of variation in the local configurations of bead packings. 

The second principal component for fa(s,t) has a maximum in a narrow strip around s = 1.16, 1.16 < t < 1.22 
and a minimum near s = t = 1.235. This suggests an appropriate index is the difference = fa(l- 167,1.167) — 
7^2(1.235,1.235). As shown in Fig. 15, this index has two linear scaling relationships with second p.c. scores, for 
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Fig 13. Depiction of the first and second principal component functions for 1- and 2-dimensional persistent homology rank functions 
from 36 cubical subsets with side 28.0. (a) First principal component for (3\, (b) first principal component for fa, (c) second principal 
component for (3\, (d) second principal component for (62- The principal component functions are scaled to have their inner product 
norm = 1. Contours are drawn at levels h = —15, —13,..., —1, 1,. .. , 13,15 in each plot. 


bead packs above (subsets of packing C) and below (subsets of A and B ) the Bernal limit, </> = 0.64. The value 
of /^2 (1.167,1.167) counts the tetrahedra and octahedral configurations that are very close to being regular in the 
sense that all their faces have circumradii within 1% of that for an equilateral triangle. The alpha-shape at radius 
1.235 has filled in all the regular tetrahedra, so (1-235,1.235) counts some quasi-regular tetrahedra, the regular 
octahedra and some quasi-regular octahedral configurations. The difference index D 2 therefore counts the highly 
regular tetrahedra, minus some quasi-regular tetrahedral and octahedral configurations. The physical significance 
of this index is unclear, but the two different linear scaling regimes suggests there are different processes relevant 
to the packings above and below the Bernal limit. 

A similar analysis of the 1-dimensional rank functions supports the findings above. The first principal component 
for /?i(s, t) has a fairly flat maximal region below the equilateral triangle point at (1.0,1.15). The values of R\ = 
Pi (1.05,1.05) have a clear linear relationship with the first p.c. scores, see Fig. 15. The quantity R\ is analogous 
to the 3-point correlation function analysed in [35]. The structure in the second p.c. for Pi(sR) again suggests an 
appropriate index could be the difference D% = /5i(1.028,1.150) — /3i(1.00,1.13). As for D 2, we see multiple roughly 
linear scaling relations for subsets from different experiments. 
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Fig 14. First and second principal component scores for the 36 rank functions, (a) First versus second p.c. scores for (3\(s,t). (b) 
Volume fraction versus first p.c. score for (3\(s,t). (c) Volume fraction versus second p.c. score for (3\(s,t). (d) First versus second 
p.c. scores for /^(s, t). (e) Volume fraction versus first p.c. score for ^(s, t). (f) Volume fraction versus second p.c. score for /^(s, t). 
Blue dots are data for subsets from experiment A, green dots from B and red dots are from the third, partially crystallised, packing C. 
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Fig 15. Geometric indices that correlate with first and second p.c. scores, (a) R\ = /3i(1.05,1.05) versus first p.c. scores of /3i(s,t) 
for the 36 subsets, (b) Volume fraction versus R\. (c) D\ versus second p.c. score for (3\(s,t). (d) R 2 = /^ 2 (1.20,1.20) versus first p.c. 
scores of p 2 (s,t) for the 36 subsets, (e) Volume fraction versus R 2 . (f) D 2 versus second p.c. score for fa(s,t). Blue dots are data for 
subsets from experiment A, green dots from B and red dots are from the third, partially crystallised, packing C. 
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7. Conclusions and future directions 

This paper has shown how the persistent homology rank function can be used as a summary statistic for distin¬ 
guishing between spatial point processes generated by different models, and as the basis for a functional principal 
component analysis of point patterns arising in physical systems. Our results establish the basic techniques needed 
for further applications. For example, the work in Section 3 could be extended to powerful topological summary 
statistics for spatial point patterns by using PC A or a machine learning approach for parameter estimation to 
determine which model a point pattern is likely to be drawn from. In this paper we limited ourselves to Cech com¬ 
plexes when building the filtrations for persistence computations. It would be interesting to use nitrations derived 
from kernel density functions of point patterns as we believe this would be particularly powerful when looking at 
clustering models. 

Our analysis of point patterns from colloids and sphere-packings can be extended to packings of non-spherical 
grains, or to any scalar function of interest via the connection that exists between Morse theory and persistent 
homology [43]. As mentioned in Section 1, a general method for building a filtration of a space X is via the sublevel 
sets of a continuous function / : X —» M. To study the geometry of packings of non-spherical grains or any porous 
material, we can use the distance function defined by the shortest distance from each point in space to the surface 
of the grains. Other physical applications might study a function that specifies the local concentration or density of 
a particular phase, an energy potential, and so on. In this context a Gaussian random field plays the null hypothesis 
role analogous to a completely spatially random point process. 

Finally, there remain theoretical questions about analysis using persistent homology rank functions to explore. 
These include how robust the pairwise distances between the rank functions are under different choices of the 
weighting function </> (2.2) and whether there is an optimal choice of (f under various scenarios. We also conjecture 
there are convergence properties for persistent homology rank functions constructed from point patterns over larger 
and larger domains after normalizing by volume. This should be particularly apparent where the point patterns 
in distant regions are independent. Complications arise from both boundary effects and long-range dependence, 
however. Even when the point patterns in disjoint regions are independent there can be homology classes that involve 
long-range behaviour. This fact is one of the main obstacles to the statistical analysis of topological quantities, but 
also the reason for their sensitivity to physically-relevant non-local properties. 
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